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Method and apparatus for recovering from nnispredicted branches in a pipeiined^processor 



(57) A method and apparatus for recovering from 
mispredicted branches in a pipelined, multiple-function- 
al-unit processor One functional unit has a shorter pipe- 
line than the other; the missing stages are compensated 
for by the FIFO "annex". Each entry in the annex corre- 
sponds to a result to be written out to the register file, 
and has a "young bit' associated with it. The "young bit", 
when set, indicates that the entry is the most recently 
calculated version of the corresponding register New 
instructions that reference the register use the "young 
but" information to ensure that they get the most up-to- 
date data. If a mispredicted branch is taken, the "young 
bit" information must be rolled back; this can be done 
either by going through the annex from oldest to young- 
est, resetting the "young bits" of all younger matching 
entries, or by keeping one or more past versions of the 
young bits for each entry, then copying the appropriate 
past.young bit to the current young bit. A combination 
of both approaches is also possible. 
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D scription 

BACKGROUND OF THE INVENTION 

1. FIELD OF THE INVENTION 

The present invention relates to microprocessor de- 
sign. Specifically, the present invention relates to recov- 
ery from mispredicted branches of speculatively execut- 
ed instructions in a pipelined processor. 

2. DISCUSSION OF THE PRIOR ART 

A typical early microprocessor included a central 
processing unit (CPU) which implemented a machine 
capable of interfacing with memory and serially execut- 
ing a sequence of Instructions. The instruction execution 
was typically broken into at least four major stages: in- 
struction and operand fetch, Instruction decode, execu- 
tion, and write back of results Into the destination regis- 
ters. A typical instruction took one clock cycle to exe- 
cute, so that each of the four functions was performed 
in that cycle. Each stage had to wait for the results from 
the previous stage before its work could be accom- 
plished. Thus, the instruction execution propagated 
through each of the four stages in order. The minimum 
clock period was then determined by the longest possi- 
ble propagation delay through all four stages. 

The concept of pipelining increased the maximum 
clock frequency by reducing the amount of logic per- 
formed in each clock cycle. To facilitate this, for exam- 
ple, the interface between the second and third stages 
could be separated by clocked latches. The first two 
stages (fetch and decode) would be performed in one 
clock cycle. Subsequently, during a second clock cycle, 
the last two stages (execution and write back) would be 
performed. Here, the overall latency of an instruction 
might remain approximately the same since the total 
amount of time from the beginning of the fetch to the 
end of the write back would be approximately the same. 
However, separating the instruction execution into two 
distinct pieces has the Important advantage that the 
throughput could be Increased by as much as a factor 
of two. This is a result of the fact that the pipelined GPU 
can operate on two instructions simultaneously While 
the execution and write back of one Instruction is being 
performed, the fetch and decoding of a second instruc- 
tion can be performed. Quite naturally, this pipelining 
concept can be extended such that each of the four stag- 
es is performed in a separate clock cycle, thus increas- 
ing the throughput accordingly. Thus, by dividing the log- 
ic into N separate siegments, the throughput can theo- 
retically be increased by a factor of N. 

Superpipelined designs break up the logic In some 
or all of the four stages so as to reduce the maximum 
propagation delay through any one stage, and thus In- 
crease the operating frequency and throughput as the 
instruction execution Is broken Into more than four pipe- 
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line stages. A superscalar microprocessor has more 
than one execution unit 1 0 (also called a functional unit) 
as shown in Figure 1 . 

A superscalar processor has several parallel func- 

s tlonal units 10. Typical superscalar processors include 
floating point, integer, branch, and load/store functional 
units. Typically, among the functional units 10 are units 
which can perform floating point calculations or similar 
complex operations. It is desirable to run these complex 

10 functional units at the same clock frequency as the rest 
of the hardware, while still allowing for each functional 
unit to begin executing a new instruction every cycle. To 
accomplish these objectives, pipelining of the parallel 
functional units 10 Is desirable. The complexity and log- 

is ical partitioning of the most complicated functional unit 
dictates the number of pipeline stages necessary in that 
functional unit. Not all of the functional units 10 have the 
same latency. 01 these four parallel functional units 10, 
the floating point unit will probably turn out to have the 

20 most complexity. Because the other three types of func- 
tional units are not as complex, it is possible to pipeline 
these other functional units into fewer stages than are 
required for the pipelining of the floating point unit. All 
of the parallel functional units 10 merge back into the 

25 final write back stage 11 in which the results of the ex- 
ecutions are written into their respective destination reg- 
isters. 

If the pipeline for every parallel functbnal unit 10 Is 
not the same number of stages, then the results from 

30 some functional units 10 will be available sooner than 
others. For example, if the pipelining of the floating point 
unit requires five stages, while the pipelining for the in- 
teger unit only takes two stages, then the results from 
the integer unit would be available three clock cycles 

35 prior to the results of the floating point unit even though 
both instructions were dispatched concurrently. By the 
same token, "younger" instructions can finish sooner 
than older instructions. For example, during one clock 
cycle, a floating point instruction Is dispatched, and dur- 

40 ing the next subsequent clock cycle, an integer instruc- 
tion such as an addition is dispatched. If the integer pipe- 
line is three stages shorter than the floating point pipe- 
line, the Integer addition result will be available two clock 
cycles before the floating point result even though the 

45 floating point Instruction was dispatched first. In this ex- 
ample, the integer addition was a "younger" instruction 
because it was dispatched later than the "older" floating 
point instruction. 

If some younger instructions are allowed to write in- 

50 to the destination registers before some older instruc- 
tions, potential problems arise. For example. If both in- 
structions write to the same destination register, the pro- 
grammer expects the younger Instruction to write over 
the results of the older instruction. If in fact, the older 

55 Instruction writes over the results of the younger instruc- 
tion, the processor has not correctly execut d the se- 
quential program since the intended result does not ap- 
pear in the Intended destination. Therefore, it is impor- 
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tant to maintain the sequential nature of instruction re- 
sult writeback. 

I n order to facilitate sequential write-back of instruc- 
tion results, shorter functional unit pipelines are length- 
ened with extra pipeline stages so that all functional 
units 10 have the same latency from dispatch to write 
back. In shorter functional unit pipelines, this results in 
several extra stages being added to the end of the func- 
tional unit pipeline, as depicted in Figure 2. These addi- 
tional pipeline stages effectively form a first-in-first-out 
(FIFO) buffer 20, sometimes referred to as a completion 
unit, which will herein be referred to as an annex 20. 

Although preserving the sequential nature of the 
program execution and write-back, the addition of the 
extra stages in the pipelines of the shorter functional unit 
pipelines creates additional complications. Assume the 
shorter functional unit 21 in Figure 2 is an integer ALU 
functional unit which might be used to execute a se- 
quence of instructions. For example, consider the fol- 
lowing program fragment where add rsl ,rs2,rd denotes 
that the contents of register rsl and the contents of reg- 
ister rs2 should be added and the result should be stored 
to register rd. 

add r1,r2,r3; 

add r3,r4,r5; Here, the first instruction creates a 
result for register rS. The very next instruction uses the 
value for r3 that was computed by the previous instruc- 
tion. Unfortunately, however, the value of r3 which was 
calculated by the first instruction has not been written 
into the register file by the time the second instruction 
begins executbn. The destination register in the register 
file is not finally updated until the write-back stage 22 is 
reached at the end of the annex 20. Therefore, the cor- 
rect operand of the second instruction is resident in the 
first entry 23 of the annex 20 when the second instruc- 
tion is executing. In order to allow the entries of the an- 
nex 20 to be utilized by subsequent instructions, some 
access to the annex 20 must be provided. 

Figure 3 depicts one way that access to the annex 
30 is provided in the prior art. Each entry in the annex 
30 has a set of outputs 31 which is fed back to at least 
one multiplexor 32 which selects either the value 
fetched from the register file or the value from one of the 
entries in the annex 30. If an entry in the annex 30 con- 
tains the most up-to-date or "youngest" version of a var- 
iable which needs to be used as an operand in a current 
instruction, that entry can be selected by the multiplexor 

32 to provide the input to the functional unit. 

When each entry of the annex 30 is fed back to a 
multiplexor 32 at the beginning of each functional unit 
33, the width of the datapath 34 can increase substan- 
tially because the number of feedthroughs increases di- 
rectly proportional to the number of entries in the annex 
30. Since each entry in the annex 30 has its own set of 
wires 31 connecting back to the multtpi xor 32 before 
the functional unit 33, a total of N extra wires per bit in 
the datapath width must pass through the functional unit 

33 for an N-entry annex 30. If the hardware has 64-bit 



wide archit cture and has an eight-deep annex, this 
means that 512 extra wires must be routed through or 
fed-through the functional unit 33. Since each wire takes 
up some non-zero amount of space, the overall width of 

5 the functional unit 33 must be increased substantially. 
In order to avoid a "waterfall routing" situation which in- 
creases the size of the datapath 34 enormously, the 
pitch or width of each entry of the annex 30 must simi- 
larly be increased to match the pitch of the functional 

10 unit 33. Thus, the entire datapath is widened by the ad- 
dition of the feedthroughs. Widening of the datapath is 
undesirable for at least two reasons. First, it consumes 
more valuable area on the integrated circuit. Secondly, 
by making the datapath wider, the control signal wires 

'5 are longer. Since the resistance of wires is not negligible 
with modem small feature size processes, longer wires 
are also slower, and thus can constrain performance. 

A second disadvantage of feeding back the entries 
30 to a multiplexor 32 is that the multiplexor 32 becomes 

20 slower as the number of inputs increases. For each en- 
try in the annex 30, an additional input to the multiplexor 
must be provided. To make matters worse, a supersca- 
lar processor, with several pipelined functional units 
may have multiple annexes. Since the youngest entry 

2S may be in any annex 30, the multiplexor 32 must have 
an input for every possible entry in any of the annexes. 
If too many inputs to the multiplexor are added, the mul- 
tiplexor becomes too slow to incorporate into the cycle 
dedicated to execution, and the maximum operating f re- 

30 quency is reduced. 

Using this method, some capability must be provid- 
ed to determine which register addresses are stored in 
each entry of the annex 30 in order to determine which 
entry, if any, should be selected by the multiplexor 31. 

3S A more advantageous prior art method involves the 
use of a memory 40 to implement the annex as depicted 
in Figure 4 and disclosed in "United States Patent Ap- 
plication For Temporary Pipeline Register File For A Su- 
perpipelined Superscalar Processor," Serial Number 

40 08/153.814, assigned to the same assignee as the 
present invention. Instead of feeding through each entry 
of the annex, only one set of feedthroughs 41 is provided 
to a two input multiplexor 42. This eliminates the in- 
crease in pitch of the functional unit 43 regardless of the 

45 number of entries in the annex 40. |n addition, the 
number of inputs to the multiplexor 42 remains two re- 
gardless of the number of entries in the annex 40. 

Similar to the previously described method, some 
capability is provided to determine which entry, if any, 

so should be provided to the multiplexor 42. One way to do 
this involves the use of content addressable memory in 
the address field of the annex. Each entry in the annex 
stores an address field and a data field. The address 
field contains the address of the destination register, 

ss while the data field holds the r suits to be stored to that 
destination register Wh n attempting to determine if 
one of the operands is in the annex, the address of that 
operand is simultaneously compared to all of the ad- 
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dresses stored within the annex 40. If there is a match, 
the match line for that entry is asserted. It is possible 
that there may be more than one match in the annex 40 
for any given operand. This would happen, for instance, 
if several instructions in a sequence all wrote to the 
same destination register In that case. It is Important 
that the most up-to-date or "youngest" entry in the annex 
40 be sent back to the multiplexor 42. 

Most instructions require more than one operand. 
For instance an add x.y.z instruction performs an addi- 
tion of the contents of x to the contents of y and stores 
the result in z; thus x and y are operand registers. Be- 
cause it is possible that data destined tor both x and y 
are in the annex 40, more than one compare port and 
set of match lines in the content addressable memory 
should be provided to allow simultaneous look-up for ail 
the operands. 

In order to resolve conflicts between multiple 
matches, a priority encoder circuit is used to allow only 
the youngest match to drive the bus connected to the 
multiplexor 42. However, the latency of a priority encod- 
er increases as the number of entries in the annex 40 
increases. Furthermore, the priority encoding is serial to 
the selection of the operands, and therefore cannot be 
removed from the critical path. Thus, the priority encode 
can add a prohibitive amount of delay. 

The use of a special "young bit" is an alternative to 
using a priority encoder. As depicted in Figure 5, each 
entry in the annex 50 has an additional bit 51 which is 
designated as the young bit 51 for that entry. For any 
given destination register address, there is only one en- 
try in the annex whose young bit 51 is non-zero. That 
entry having a "true" logic value in the young bit 51 po- 
sition contains the youngest and most up-to-date ver- 
sion of the data to be stored in that given destination 
register. All older versions for that given register address 
have deasserted young bits. Each entry also has a valid 
bit 52, which indicates whether or not the entry contains 
valid data which should be written to the destination reg- 
ister. 

Using the young bit scheme, the match line for each 
entry in the annex is logically ANDed with the young bit 
and valid bit to create a "young match" signal for that 
entry. If there are multiple matches in the annex when 
an operand register address is applied to the content 
addressable address portion 53 of the annex, only one 
young match signal will be asserted at any given time. 
The increase in speed using the young bit scheme over 
the priority encoder scheme is at the expense of in- 
creased complexity in maintaining and updating the 
young bits 51. 

A typical scheme lor managing and updating the 
young bits 51 requires a comparison port in the content 
addressable memory portion 53 of the annex. The des- 
tination addr ss for each instruction being executed in 
the functional unit is compared to the address portion of 
each entry 54 in the annex 50. For each entry 54 with a 
match, the young bit 51 is reset. Every new entry 55 in 



the annex 50 always enters the annex pipeline 50'with 
an asserted young bit 51 , because the results of an in- 
struction that has just been executed is always the 
youngest value for that destination register. 
s A fundamental difficulty in any pipelined processor 
involves the handling of conditional branches in the ex- 
ecution sequence. Most programs include frequent con- 
ditional branches. When executing such a program in a 
pipeline, there is no way to tell with certainty which in- 
10 struction should be fetched for execution directly after 
the branch instruction because the execution sequence 
depends upon the results of the branch Instruction. 
Since, in a pipelined processor, any instruction is not 
executed until several clock cycles after it is fetched, 
^5 there is no way to decide with certainty which instruction 
should be fetched, dispatched, or for which execution 
should be started after the branch instruction. 

A common way to handle this problem is to choose 
one possible execution sequence, hoping that the con- 
dition will be resolved such that the choice turns out to 
have been correct. In the case that the choice is incor- 
rect, the processor must be able to recover from the mis- 
prediction. In many processors, this recovery calls for 
the voiding or aborting execution of all the instructions 
that were fetched, dispatched, or for which execution 
was started along the mispredicted execution path. All 
stages prior to the stage when the branch becomes re- 
solved must be voided. 

In a processor with an annex 50 that uses a young 
bit 51 , however, the problem is more complicated. When 
a processor encounters a conditional branch, it must 
choose which segment of the code to prefetch. The 
prefetched Instructions are subsequently executed 
speculatively until the branch is actually resolved. The 
branch resolution latency can be several cycles depend- 
ing upon the degree of pipelining. Thus, it is possible for 
an instruction to be executed, the young bit 51 set in the 
annex 50 for the entry 55 containing that instruction's 
result, and then later it is determined that the branch 
prediction is incorrect. A significant consequence of set- 
ting the young bit 51 for a speculatively executed in- 
struction is that the previously asserted young bit 51 for 
an entry 54 having the same destination address is 
cleared. In attempting to recover from the mispredicted 
branch, the previous values of the young bits 51 which 
existed prior to the mispredicted branch must be re- 
stored to the entries 54 of the annex 50. 

Providing the capability to restore the previous state 
of the young bits 51 is a rather complicated task. One 
way to provide this capability is the maintenance of an 
elaborate pointer table. In this scheme, each annex en- 
try 54 has a pointer which points to the next older entry 
54 which shares the same destination register address. 
Since multiple assignments to the same register ad- 
dress may occur within a mispr dieted branch, even the 
annex entri s 54 with a reset young bit 51 must point to 
the next older entry 54 for that destination address, 
since it may be necessary to traverse more than one link 
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of the fist to find the youngest result which was not th 
result of an erroneous speculative execution. This solu- 
tion Is very expensive in temns of the hardware required 
to innplement it. 

SUMMARY OF THE INVENTION 

The present invention is an apparatus and a method 
for performing restoration of the previous values of the 
young bits after either a mispredicted branch has been 
detected and speculative instruction results have been 
voided or a predicated instruction has been speculative- 
ly executed while Its predicate Indicates it should not 
have been executed. 

According to a first embodiment of the present in- 
vention, after a mispredicted branch has been detected, 
the annex entries containing the speculative Instruction 
results are Invalidated. Beginning with the oldest annex 
entry, the destination register address of that entry is 
broadcast to all the other entries. The broadcast entry 
is simultaneously compared to the addresses of all the 
other entries. All annex entries with matching addresses 
have their young bits reset, while the young bit of the 
broadcast entry Is set. The above broadcast, compare, 
reset, and set operations are performed on all the re- 
maining valid entries sequentially and in order of de- 
creasing age. When all valid entries have been broad- 
cast, the young bit states are correctly reconstructed. 
This embodiment reconstructs the young bits using ex- 
isting logic in the annex, and thus incurs no additional 
hardware costs. 

According to a second embodiment, each annex 
entry has a current young bit ( YO), and one or more past 
young bits (Y1 through YN). For branch condition reso- 
lutions which take up to N clock cycles. N past young 
bits (Y1 through YN) are maintained in each annex en- 
try. During every machine cycle in which a new annex 
entry is received, the current young bit (YO) is shifted 
into the first past young bit position (Y1 ). Simultaneous- 
ly, the contents of Y1 is shifted into Y2. Similarly, the 
contents of Yk is shifted into Yk+1 for all k from 0 to N- 
1 . Also when a new annex entry is received, and the 
current young bits (YO) are updated. 

When a mispredicted branch is detected after the 
results of i speculative instructions have been entered 
into the annex, the past young bit in Yi is restored back 
into the current young bit YO tor each annex entry. Thus, 
restoration of the correct young bits is performed in a 
small constant time. 

Both embodiments may be implemented in the 
same processor to optimize the recovery from both long 
latency mispredictions and short latency mispredictions. 
The first embodiment is nnore suitable for long latency 
conditionals, while the second embodiment is more suit- 
able for short latency conditionals. Th pref rred recov- 
ery method can be determined dynamically by th proc- 
essor or statically based on the nature of the recovery. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates a typical superscalar processor 
architecture. 

Figure 2 Illustrates a two-scalar execution block di- 
agram in which one functional unit has less latency and 
thus has an annex. 

Figure 3 depicts a prior art manner of providing the 
functional units with access to the contents of each entry 
in the annex. 

Figure 4 depicts a second prior art manner of pro- 
viding the functional units with access to the contents of 
each entry in the annex. 

Figure 5 shows the organization of data in an eight 
entry annex. 

Figure 6 shows a possible state of the annex when 
a mispredicted branch has been detected. 

Figure 7 is a flowchart of the steps used to imple- 
ment the first embodiment of the present invention. 

Figure 8 shows the organization of data in an eight 
entry annex having storing N previous young bit states 
according to a second embodiment of the present in- 
vention. 

Figure 9 is a flowchart of the steps used to imple- 
ment the second embodiment of the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention allows a relatively convenient 
way to perform restoration of the values of the young 
bits 51 in the entries of the annex prk^r to the mispre- 
dicted executions. Figure 5 shows the data organization 
in the annex. The annex 50 can be implemented as a 
shift register. The results of newly executed instructions 
are shifted into leftmost entry 55 of the annex, and the 
oldest entries in the rightmost entry 56 are shifted out 
to the writeback stage. Each entry 54 stores the desti- 
nation register address 53 and the data 57 to be stored 
to that register. Additionally, each entry 54 has a young 
bit 51 and a valid bit 52. 

The valid bit 52 is used to indicate whether the data 
should actually be written into the register when the 
writeback stage is reached at the end of the annex pipe- 
line 50. If the valid bit 52 for an annex entry 54 is set, 
the data 57 in that entry 54 is meaningful and will be 
stored to the register indiicated by the address field 53. 
If the valid bit 52 is reset, however, the data 57 in that 
entry 54 is not meaningful and will not be stored to the 
register indicated by the address field 53. Not ail instruc- 
tions produce results which need to be written into a reg- 
ister; however, the entries of the annex must shift for- 
ward during every cycle so as to maintain a fixed time 
interval between execution and write back. Therefore, 
whenever an instruction is executed which produces no 
result, th youngest ann x entry is marked Invalid. The 
valid bit 52 provides a convent nt way to flush the results 
of instructions which wer executed as part of a mispre- 
dicted branch. Thus, deass rting the valid bit is a way 
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to nullify entries. If the condition is resolved such that 
the execution of a sequence of instructions is an error, 
the results of those Instructions can be invalidated or 
nullified simply by deassertlng the valid bits 52 for those 
instructions. 

In the following code fragment, a conditional branch 
occurs. The instruction "branch^O 14. target" means that 
if the contents of register 14 Is non-zero, the next Instruc- 
tion to be executed should be the instruction at the target 
address. On the other hand, if the contents of register 
14 are zero, then execution should continue sequentially 
with the next instruction after the branch Instruction. 

nor 11.14. 12 

add 11.12, 13 

or 11,13. 10 

branch_0 14, target 

move II, 13 

add 12,15, 17 

target: add 13,14. 16 
add 14,12. 15 

Assume that the processor fetches the instructions 
directly after the conditional branch instruction. Further- 
mor assume that these instructions are speculatively 
executed. Ail of the Instructions except the branch result 
in the assignment of a value to a register. Since the 
move instruction makes a second and later assignment 
to register 13, the young bit 51 for the previous assign- 
ment to register 13 has been deasserted. Since there are 
no older assignments to any of the other registers In the 
annex 50, all the other young bits 51 are asserted. 

If the contents of register 14 turn out to be non-zero, 
then the speculative executions of the move and add 
instructions need to be reversed or voided. First, the val- 
id bits 52 for the youngest two entries in the annex are 
deasserted. As a consequence of deassertlng the valid 
bits 52 for these results, they are effectively flushed from 
the annex 50, since the write-back stage will not act up- 
on them. The next task which must be performed is the 
restoration of the young bits 51 to the state in which they 
existed prior to the speculative executions. If the proc- 
essor had a five-entry annex 50. the state of the annex 
would appear as shown in Figure 6. 

A destination register compare port exists In the an- 
nex. As an instruction is executed by a functional unit, 
that instruction's destination register address is applied 
to the compare port and is compared against the ad- 
dress fields of all the entries In the annex. The compar- 
isons can all be done simultaneously if the address part 
of the annex is implemented as content addressable 
memory. For every annex entry that matches, the young 
bit is reset. When the instruction's results are calculated, 
they are entered Into the annex with the young bit set. 
In this way. only the youngest data for that particular 
destination register which is newly produced from a 
functional unit has an asserted young bit. 

Using a first embodiment of the present invention, 
the restoration of the state of the young bits aft r a mis- 
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predicted branch execution Is performed sequentially 
using the same hardware that was used to originally set 
the young bits in the annex. The only significant modifi- 
cation to the hardware datapath is that each annex entry 
5 has the ability to drive the compare bus through the an- 
nex. 

Once the entries of the annex corresponding to a 
mispredicted branch have been flushed by deasserting 
their respective valid bits, the recovery of the previous 
10 young bits is performed sequentially The entry at the 
end of the annex pipeline is the oldest entry in the annex. 
Beginning with the oldest entry, the address in that entry 
is applied to the compare bus and is compared to all the 
other entries in the annex. The young bits of all entries 
^5 with matching destination addresses are cleared except 
for the entry which is currently driving the compare bus. 
The young bit of the entry driving the bus is asserted. 

After the oldest entry has driven the compare bus 
and affected the state of the young bits, the second old- 
est drives the compare bus and affects the young bits. 
This process is performed sequentially until all of the 
valid entries have driven the compare bus Once the 
youngest valid entry has driven the compare bus. deas- 
serted the young bits of those entries having matching 
destination addresses, and asserted its own young bit. 
the state of the young bits prior to the speculative exe- 
cutions has been completely reproduced. Those entries 
which were invalidated by deasserting their valid bits do 
not drive the compare bus. The young bits of those 
Invalid entries may be affected by the sequential recon- 
struction, but this does not matter since essentially no 
hardware will pay any attention to them because they 
are invalid. 

Since the reconstruction of the young bits was se- 
quential, the amount of time necessary to reconstruct 
grows as the number of remaining valid entries grows. 
This scheme has the advantage that repair to the annex 
can be made while the correct instructions are being 
fetched. The total number of pipeline processor stages 
prior to the first stage which refers to the annex defines 
the number of valid entries whose young bits may be 
reconstructed with no overhead. If that number of proc- 
essor stages preceding the functional unit is greater 
than or equal to the number of remaining valid entries 
in the annex, then the reconstruction of the young bits 
can always occur during the time when the correct in- 
structions are being fetched and processed in the pipe- 
lined processor even if each Individual step in the se- 
quential reconstruction takes one full clock cycle. Once 
a misprediction is detected to have occurred, the proc- 
essor can fetch the correct Instruction beginning in the 
next cycle. However, that correct instruction will not 
reach the functbnal unit until it has worked its way 
through the preceding pipeline stages. Thus, the annex 
has that predetermined amount of time to reconstruct 
its young bits. 

One of the consequences of performing sequential 
recovery as described is that the sooner a condition is 
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resolved, the longer it takes to reconstruct the young bits 
of the annex. This paradox is a result of the fact that it 
the condition is quickly resolved, fewer of the annex en- 
tries are invalidated, and more steps in the sequence 
nnust be perfornned. In the worst case if only one spec- 
ulative instruction is executed before the condition is re- 
solved, all but one of the annex entries must drive the 
compare bus in order to reconstruct the young bits. Con- 
versely, if nearly all of the entries of the annex are inval- 
idated speculative instruction results, the reconstruction 
of the young bits takes relatively little time, since only 
the oldest few of the entries are valid, and only a few 
steps of the sequence need to be performed. 

Figure 7 illustrates the sequential method for restor- 
ing the state of the young bits. The start step 100 is 
reached when a condition has been resolved such that 
the speculative executbn sequence was incorrect. At 
step 110, the results of the speculative execution which 
are resident in some of the youngest (leftmost) entries 
of the annex are invalidated by deasserting their valid 
bits. Step 1 20 is a test to determine if all the valid entries 
have been acted upon. Once each remaining valid entry 
has broadcast its address to all the other entries, thus 
updating the state of the young bits, the restoration of 
the young bits is complete. The stops of the sequential 
restoration begin with the rightmost (oldest) entry o1 the 
annex (illustrated in Figure 6) and proceed through the 
leftmost (youngest) valid entry (illustrated in Figure 6). 
The first invalid entry is one entry younger than the 
youngest valid entry. When the first invalid entry is 
reached, the restoration is complete and the completion 
stage 1 30 is reached. 

In step 140, the destination register address of the 
oldest valid entry that has not already been broadcast 
to every other entry is broadcast to every entry. In step 
150, for those annex entries whose destination address 
matches that of the address being broadcast, the young 
bits are deasserted or reset. The young bit of the entry 
whose address is being broadcast is asserted or set. 
The order in which the resetting and setting occurs is 
not important. 

The pipelined processor architecture may be such 
that the number of entries in the annex is excessively 
large. Alternatively, there may be too few front-end pipe- 
line stages to provide enough time to sequentially re- 
construct the young bits. In either case, the sequential 
method of young bit reconstruction will cause delays in 
processing. In order to avoid such delays, the second 
embodiment of the present invention can be used. 

A second embodiment of the present invention al- 
lows reconstruction of the young bits when the branch 
conditions are resolved with low latency. In such cases, 
relatively few of the young entries of the annex need to 
be invalidated. This embodiment is most useful when 
there are some branch conditions which can be resolved 
in a small number of clock cycles. Figure 8 shows the 
organization of information in the annex 80 according to 
this second embodiment of the present invention in 



which branch conditions are resolved no later than N 
clock cycles after the speculative execution has begun. 

Each annex entry 81 has a destination address field 
82 to which the contents of the data field 83 will be stored 

5 for each valid entry by the write-back stage at the end 
of the annex pipeline 80. The valid bit 84 in each entry 
specifies whether or not that entry is valid, and thus 
should actually be stored to the destination register in 
the write-back stage at the end of the annex pipeline 80. 

10 Each annex entry has a current young bit 85; this field 
is labeled YO. Additionally, each annex entry 81 has pre- 
vious young bit field 86 labeled Y1 . The operation of a 
system using the second embodiment will be described 
where N is two. If N = 2, the system is designed to handle 

15 two clock cycle condition resolution. Thus, only two ad- 
ditional bits (Y1 and Y2) per entry are added to the an- 
nex 80. The bit field labeled Y1 86 is the previous state 
of the young bits during the last clock cycle. The bit field 
labeled Y2 87 is the next older previous state of the 

20 young bits that existed two cycles prior. 

Every time an instruction is executed by afunctional 
unit, the destination register address for that instruction 
is compared to all of the destination register addresses 
stored in the address field 82 of the annex entries 80. 

25 Only when there is a match or hit in the annex 80 does 
one of the previously existing current young bits 85 
stored in the YO bit position change in the annex 80. In 
this case when a destination register address match 
causes one of the YO bits in the annex 80 to change 

30 from assertion to deassertion, the previous state of the 
YO bit fields 85 are stored into the Y1 bit positions of 
each entry 81 in the annex 80. When there is not a match 
in the annex of the currently executing instruction's des- 
tination register address, the state of the YO bits 85 are 

35 stored into Y 1 86 even though they are identical to those 
in YO 85. The fact that YO and Yl are identical is a result 
of the fact that if a speculative execution needs to be 
voided and there was no destination register address 
match for that speculatively executed instruction, the 

40 previous state of the young bits is equal to the current 
state for those entries in the annex 80 that remain valid. 
Thus, the values in YO 85 are stored into Y1 66 every 
clock cycle. Similarly, the young bit values in Yl 86 are 
stored into Y2 87 during each clock cycle. Therefore, YO 

45 85 contains the current state of the young bits, Yl 86 
contains the state of the young bits one clock cycle ago, 
and Y2 87 contains the state of the young bits two clock 
cycles ago. 

Whenever a mispredicted branch occurs, some of 
50 the entries 61 in the annex 60 are invalidated, and the 
state of the young bits 85 must be restored. If the branch 
is resolved in the next clock cycle, the Yl young bits 86 
are copied into the YO position 85 for each entry 81 in 
the annex 80. If it takes the full two cycles to resolve the 
55 branch, the Y2 young bits 87 ar copied into the YO bit 
position 85 to restore the current YO young bits 85 to 
th ir correct values. 

In this implementation, one of the premises of the 



7 



JSDOCID: <EP_0724ai5A1J_> 



13 



EP 0 724 215 Ai 



14 



design was that conditional branches couid be resolved 
within two clock cycles. Therefore, no more than two 
previous states of the young bit 85 for each entry 81 
need to be stored, because all branches will be resolved 
by at most two clock cycles. In this example, at most two 
entries 81 in the annex 80 will be invalidated because 
the branch resolution latency is at most two cycles. 

The above methodology of the second embodiment 
can be extended to processor implementations in which 
up to N cycles are required to resolve conditions. If up 
to N cycles are required to resolve conditions, N young 
bit fields need to be added to each entry 81 of the annex 
80. Since the N speculatively executed instructions 
could each store to a destination register address which 
is already in the annex 80, it is possible that an N<leep 
stack of previous young bit values need to be stored in 
each annex entry 81 in order to guarantee that the ap- 
propriate state of the young bits is available for restora- 
tion. Thus, each annex entry 81 would have the current 
young bit field YO 85 as well as previous young bits Y1 . 
Y2, ... YN 88. 

Figure 9 illustrates the method of maintaining and 
restoring the young bits according to the second embod- 
iment. Step 200 indicates the initial entry point into the 
infinite loop which will be performed as long as the proc- 
essor is running. In step 210, the current state of the 
young bits (YO) is stored into the position for the previous 
young bits Y1. Simultaneously, the previous young bits 
Y1 are shifted into the Y2 position. Similarly, the bits in 
Y2 are shifted into Y3. Since the oldest state of the 
young bits to be stored is YN, YN can not be stored into 
YN+1. Instead, YN is discarded. However, the discard- 
ing of the YN bits is not problematic, because even if the 
instruction which produced the entry N cycles ago was 
speculatively executed, the contingency has been re- 
solved by that time. 

In step 220, a new entry is received into the annex 
pipeline from a functional unit. In step 230, the current 
young bits YO are updated. The newest entry which just 
entered the annex always has its young bit set. The 
young bits of any entries having the same destination 
register address as the newest entry are reset. 

In step 240. a test is performed to see it a mispre- 
dicted branch has been encountered. If no mispredicted 
branch is detected, the process reverts back to step 21 0. 
If a mispredicted branch occurs, the process flow pro- 
ceeds to step 250. Here, i indicates the number of en- 
tries resident in the annex which were the result of spec- 
ulatively executed instructions. If all instructions result 
in an annex entry, this means the conditional branch 
which is just now being resolved occu rred i clock cycles 
ago. The i speculatively executed instructions are inval- 
idated by resetting their valid bits. At step 260, the cor- 
rect young bits are restored by copying young bits from 
the position Yi into the YO position for each entry in the 
annex. The process then reverts back to step 210. 

For a given architecture, the restoration of the 
young bits using this second embodiment requires a 



constant amount of time regardless of the number of in- 
validated speculatively executed results. 
Furthermore, this constant amount of time is small since 
essentially the only significant operation is the transfer- 

s ring of the bit values from position Yi into position YO. 

in the extreme case in which all conditions were 
known to be resolved in only one clock cycle, then only 
one additional young bit YI per entry would need to be 
supplemented to the current young bit YO. This is the 

10 case, for example, for a predicated instruction, such as 
a conditional move. A predicated Instruction condition- 
ally executes based upon some existing state in the ma- 
chine. For instance, it might execute only if a certain con- 
dition code exists in some special status registers of the 

'5 processor. In that case, there is no separate branch in- 
struction in the instruction stream, but instead is an im- 
plied condition in an individual instruction. 

In the case that more is known about the resolution 
time for branches and the nature of the sequence of in- 

20 slructions, other optimizations can be made. For exam- 
ple, if it is known exactly how many cycles every condi- 
tion will require to be resolved and that no additional 
branches will be acted upon by the functional unit during 
this resolution time, only one additional set of young bits 

2S needs to be provided in each annex entry. The copying 
of the current young bit state into the previous young bit 
position would only occur directly before the speculative 
results entered the annex. As another example, if the 
maximum number of destination address matches 

30 (those instructions that affect the current state of the 
young bits) that could occur during the resolution of the 
a condition was known to be less than the number of 
cycles required for condition resolution, then the number 
of extra young bits provided for storage in each annex 

55 entry could be reduced accordingly at the expense of 
slightly more complicated control logic. 

In a typical processor architecture, it might be pru- 
dent to implement both embodiments simultaneously. 
The sequential method of the first embodiment is faster 

40 when the condition resolution latency is higher, resulting 
in relatively nnore invalidated entries and relatively fewer 
remaining valid entries. The second embodiment using 
supplemental young bits is most efficient when the con- 
dition resolution latency is lower, resulting in relatively 

45 fewer invalidated entries and relatively more valid en- 
tries. In order to handle graceful recovery from both 
short and long mispredicted branches without stalling 
the processor or increasing the hardware costs very 
much, it may be advantageous to use the embodiments 

50 as alternatives to each other in the same annex struc- 
ture. The sequential method would be used for longer 
mispredicted branches, and the supplemental young bit 
method would be used for shorter mispredicted branch- 
es. The recovery mechanism utilized would be dynam- 
os ically controlled dependent upon the type of branch or 
predicat d instruction or statically controlled based up- 
on th nature of the recovery. 

While the method and apparatus of th present in- 
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vention has been described in t rms of its presently pre- 
ferred and alternate embodiments, those skilled in the 
art will recognize that the present invention may be prac* 
ticed with modification and alteration within the spirit and 
scope of the appended claims. The specifications and 
drawings are, accordingly, to be regarded in an illustra- 
tive rather than a restrictive sense. 



Claims 

1 . A method of processing a young bit value in an an- 
nex having a plurality of entries, wherein each entry 
includes address, data, and young bit information, 
comprising steps of: 

comparing a first destination address of a first 
annex entry to a second destination address of 
a second annex entry; and 
resetting the young bit in the second annex en- 
try if the first destination address and the sec- 
ond destination address are equal. 

2. A method as in claim 1 , further comprising the step 
of: 

setting the young bit in the first annex entry. 

3. A method of generating young bit values in an an- 
nex having a plurality of entries, wherein each entry 
includes address, data, and young bit information, 
comprising steps of: 

comparing a first destination address of a first 
annex entry to every other destination address 
of every other annex entry; and 
resetting the young bit in each annex entry in 
which the first destination address equals the 
destination address in that annex entry. 

4. A method as in claim 3. further comprising the step 
of: 

setting the young bit in the first annex entry. 

5. A method as in claim 4, wherein the comparisons 
of the first destination address of the first arinex en- 
try to every other destination address of every other 
annex entry are performed simultaneously. 

6. A method as in claim 5, wherein the first destination 
address of the first annex entry Is broadcast to every 
other annex entry for comparison. 

7. A method of generating young bit values in an an- 
nex having a plurality of entries, wherein each entry 
includes address, data, and young bit information, 
comprising steps of: 

(a) broadcasting to ev ry oth r valid annex en- 



try for comparison a destination address of an 
oldest valid annex entry which has not previ- 
ously been broadcast; 

(b) comparing the destination address of the 
s oldest valid annex entry which has not previ- 
ously been broadcast to every other destination 
address of every other valid annex entry; 

(c) resetting the young bit in each valid annex 
entry in which the destination address of the 

10 oldest valid annex entry which has not previ- 

ously been broadcast equals the destination 
address in that valid annex entry; 

(d) setting the young bit in the oldest valid an- 
nex entry whose destination address has not 

IS previously been broadcast; and 

(e) repeating steps a through d once for each 
valid annex entry ending with the youngest val- 
id annex entry. 

20 8. A method as in claim 7, wherein step a further com- 
prises: 

broadcasting to every invalid annex entry for 
comparison the destination address of the oldest 
valid annex entry which has not previously been 
2S broadcast. 

9. A method as in claim 7, wherein step b further com- 
prises: 

comparing the destination address of the old- 
30 est valid annex entry which has not previously been 
broadcast to every destination address of every 
invalid annex entry. 

10. A method as in claim 7, wherein step c further com- 
35 prises: 

resetting the young bit in each invalid annex 
entry in which the destination address of the oldest 
valid annex entry which has not previously beien 
broadcast equals the destination address in that 
40 invalid annex entry. 

11 . A method of maintaining young bit informatbn in an 
annex having a plurality of entries, wherein each en- 
try has a current young bit storage position and a 

45 past young bit storage position, comprising steps of: 

storing contents of the current young bit stor- 
age position Into a past young bit storage posi- 
tion prior to the receiving of a new entry into the 
50 annex; and 

copying the contents of the past young bit stor- 
age position into the current young bit storage 
position when a mispredicted branch has been 
detected. 

55 

12. Am thodas in claim 11, further comprising th step 
of: 

updating the current young bit after a new en- 
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try has been received into the annex. 

1 3. A method of maintaining young bit information in an 
annex having a plurality of entries, wherein each en- 
try has a current young bit storage position (Y[0]) 
and N past young bit storage positions (Y[1 ] through 
Y[N1), comprising steps of: 

for each x, wherein x is between 0 and N-1 , in- 
clusive, storing the contents o( Y[x] into Y[x+1] 
for each annex entry prior to the receiving of a 
new entry into the annex: and 
copying the contents of one of the past young 
bit storage positions into Y[0] for each annex 
entry when a mispredicted branch has been de- 
tected. 

14. A method as in claim 13, wherein in the copying 
step. Y[i] is the past young bit storage position cop- 
ied into Y{0], where i Is a number of annex entries 
invalidated by detection of the mispredicted branch. 

15. An apparatus for processing a young bit value in an 
annex having a plurality of entries, wherein each en- 
try Includes address, data, and young bit Informa- 
tion, comprising: 

a cornpare circuit for comparing a first destina- 
tion address of a first annex entry to a second 
destination address of a second annex entry; 
and 

a deassertion circuit for resetting the young bit 
in the second annex entry if the first destination 
address and the second destination address 
are equal. 

16. An apparatus as in claim 15, further comprising: 

an assertion circuit for setting the young bit In 
the first annex entry. 

17. An apparatus for generating young bit values in an 
annex having a plurality of entries, wherein each en- 
try includes address, data, and young bit Informa- 
tion, comprising: 

a compare circuit for comparing a first destina- 
tion address of a first annex entry to every other 
destination address of every other annex entry; 
and 

a deassertion circuit for resetting the young bit 
in each annex entry in which the first destina- 
tion address equals the destination address in 
that annex entry. 

18. An apparatus as in claim 17. further comprising: 

an assertion circuit for setting the young bit in 
the first annex entry. 



19. An apparatus as in claim 18, wherein the compare 
circuit for comparing the first destination address of 
the first annex entry to every other destination ad- 
dress of every other annex entry allows simultane- 

5 ous comparison. 

20. An apparatus as in claim 19, wherein the first des- 
tination address of the first annex entry is physically 
routed via a bus to every other annex entry for com- 

10 parison, 

21. An apparatus for generating young bit values in an 
annex having a plurality of entries, wherein each en- 
try Includes address, data, and young bit informa- 

is tion, comprising: 

(a) a broadcast circuit for broadcasting to every 
other valid annex entry for comparison a desti- 
nation address of an oldest valid annex entry 

20 which has not previously been broadcast; 

(b) a compare circuit for comparing the desti- 
nation address of the oldest valid annex entry 
which has not previously been broadcast to 
every other destination address of every other 

2S valid annex entry; 

(c) a deassertion circuit for resetting the young 
bit in each valid annex entry in which the des- 
tination address of the oldest valid annex entry 
which has not previously been broadcast 

30 equals the destination address in that valid an- 

nex entry; 

(d) an assertion circuit for setting the young bit 
in the oldest valid annex entry whose destina- 
tion address has not previously been broad- 

35 cast; and 

(e) a repetition circuit for repeating steps a 
through d once for each valid annex entry end- 
ing with the youngest valid annex entry. 

40 22. An apparatus as in claim 21 , wherein component a 
further comprises: 

a broadcast circuit for broadcasting to every 
invalid annex entry for comparison the destination 
address of the oldest valid annex entry which has 

45 not previously been broadcast. 

23. An apparatus as in claim 21. wherein component b 
further comprises: 

a compare circuit for comparing the destina- 
so tion address of the oldest valid annex entry which 
has not previously been broadcast to every desti- 
nation address of every invalid annex entry. 

24. An apparatus as In claim 21, wherein component c 
ss further comprises: 

an deassertion circuit for resetting the young 
bit in each Invalid annex entry in which the destina- 
tion address of the oldest valid annex entry which 
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has not previously been broadcast equals the des- 
tination address in that invalid annex entry. 

25. An apparatus for maintaining young bit Information 

in an annex having a plurality of entries, wherein 5 
each entry has a current young bit storage position 
and a past young bit storage position, comprising: 

a storage circuit for storing contents of the cur- 
rent young bit storage position Into a past young io 
bit storage position prior to the receiving of a 
new entry into the annex; and 
a copy circuit for copying the contents of the 
past young bit storage position into the current 
young bit storage position when a mispredicted is 
branch has been detected. 

26. An apparatus as In claim 25. further comprising: 

an update circuit for updating the current 
young bit after a new entry has been received into 20 
the annex. 

27. An apparatus for nnaintaining young bit Information 
in an annex having a plurality of entries, wherein 
each entry has a current young bit storage position 2S 
(Y[0]) and N past young bit storage positions (Y[1J 
through Y[N]), comprising: 

for each x. wherein x is between 0 and N-1, in- 
clusive, a storage circuit for storing the contents 30 
of Y[x] Into Y[x+1 ] for each annex entry prior to 
the receiving of a new entry into the annex; and 
a copy circuit for copying the contents of one of 
the past young bit storage positions into Y[01 
for each annex entry when a mispredicted 35 
branch has been detected. 

28. An apparatus as in claim 27, wherein In the copy 
circuit for copying, Y[i] is the past young bit storage 
position copied into Y[0], where i is a number of an- 40 
nex entries invalidated by detection of the mispre- 
dicted branch. 

29. An apparatus as in claim 19. wherein the compare 
circuit tor comparing Is a content addressable mem- 45 
ory. 

30. An apparatus as in claim 23. wherein the compare 
circuit for comparing Is a content addressable mem- 
ory, so 
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